Version control is a system to systematically record changes to files in a project, in such a way that it is always possible to recall specific versions of the files at any point in the life of the project.
There are different reasons why version control is a valuable tool in reproducible research.
Having versions of a project is useful because it prevents error propagation. For instance, if changes are made in a coding project that introduce error (e.g., the analysis fails, the code breaks, etc.), it is always possible to retrieve files from a previous version before any questionable changes were completed.
Version control also allows projects to grow. For instance, there might be some common elements in different projects, say, data preprocessing, after which analysis can differ. Version control would allow multiple projects to branch from that common root with ease.
Version control can be done locally.
In this case, files are stored in a local system and are identified by versions. The user can select which version to work with. A very simple way of doing version control is by naming files according to progression. Many of us will call a file “Document-Sept-19-2019”, and then when we work on it again and make changes, save the new file as “Document-Oct-15-2019”. This is a rudimentary and inefficient way of doing version control, given the proliferation of files and the difficulty of keeping track of changes.
Version control systems instead keep a database of files by version, while keeping track of changes.
Version control systems can also be centralized.
In this case, there is a version control server that serves versions of the project to one or more users, who also keep the project locally in their systems. An advantage of a centralized version control system is that it automatically provides backup for the project in the server. Another advantage is that a centralized system also makes it easy to collaborate with others - there will be more on this at a later session.
In this session we will introduce some basic concepts and tools of version control.
Version control is implemented in a systematic way by different software applications. They include PerForce, Mercurial, and Bitbucket. For this seminar, we will use GitHub, a centralized version control system based on Git.
Git is a version control system that was created by Linus Torvalds (of Linux fame) after the source-control management system that he and his collaborators had been using during the creation of the Linux kernel.
Git in British English slang means a contemptible and unpleasant person, and without much argument Torvalds called the project “the stupid content tracker”. Git, like Linux, is an open source free project.
GitHub is a hosting service for version control based on Git. It implements the functionality of Git and complements with additional features. As of 2019 it is reported to have 38 million users and at least 28 million public repositories, which makes it the world’s largest host of source code - althought source code is only one class of information that GitHub serves.
For this seminar you will need to create an account with GitHub. An account can be created for free which offers full functionality using public repositories. A GitHub Pro account is needed in order to keep private repositories, but for the purpose of this seminar you do not need it.
If you do not have an account yet, please create one now.
Now that you have a GitHub account you can set your system up so that it can communicate with GitHub.
Go to your profile icon and navigate to “Settings”.
Once there, navigate to “Developer Settings”.
Choose “Tokens (Classic)”. A token is an identification tool.
Choose the option to generate a new token. I usually generate a token for each of my systems (e.g., my laptop, my desktop, etc.) but it is possible to generate finer grained tokens that are specific to a repository. I have not yet seen the need for that in my work.
The note allows you to use text to remind you that this token is for. I set the expiration to 30 days so that I must regenerate the token frequently. This is good for security, but for me it is also important to keep the process fresh in my memory.
The scopes are the permissions that this token grants: can it manipulate repositories? can it delete them? And so on. Go on, give this token all permissions, but remember that it is very important that you keep the token safe and you don’t share it or otherwise allow it to become compromised.
Your new token will be displayed. You can copy it and use it to authenticate your system.
To authenticate your system, go to R Studio. You will need package {gitcreds}, which handles git credentials from
R. Use:
gitcreds::gitcreds_set()
And choose the option to set or replace your credentials. At the prompt, paste the token that you generated on GitHub. Now GitHub will know that this system is allowed to interact with it in your name. Remeber to navigate away from the GitHub page where your token was displayed.
Once you have a token, you basically can regenerate following the
steps above: navigate to “Settings” > “Developer Settings” >
“Tokens (Classic)” and then choose the token you wish to regenerate.
Then use gitcreds::gitcreds_set() to replace the previous
token with the new one.
Let us quickly tour GitHub.
The starting point for using version control is to create a repository. You can do this directly on GitHub:
You will need to choose a name for the repository and select the settings. See below:
The settings allow you to create a README.md document immediately, so that the repository is not empty.
As an alternative, you can do this from R Studio. After creating a
project, use usethis::use_git() and
usethis::use_github(). The first function will initialize a
repository locally. The second will push it to GitHub.
Any text files can be edited directly using the online editor:
Instead (or in addition) to credentialling your system with GitHub
from R account, you can use an app called GitHub Desktop,
which can be downloaded here.
This app allows you to interact with GitHub from your local system.
Create a new R Studio project and name it “my-GEO712-repository”
Initialize a reproducible environment and install packages as needed
Use package {usethis} to initialize git for the repository
(usethis::use_git()) and then set it up to work with GitHub
(usethis::use_github())
Create a README.Rmd file (usethis::use_readme_rmd())
and transfer the contents of your previous activity. Alternatively, copy
the file to this repository
Use R Studio to edit the README.Rmd file; use markdown syntax to introduce your repository (don’t forget to knit when you are done)
Commit your changes, and then push them to the GitHub repository
Once you have done this, go to GitHub and in the repository
settings look for “Collaborators”. Invite me to collaborate (my GitHub
handle is paezha)
Please sign-up for a DMP Assistant @ https://assistant.portagenetwork.ca/?locale=en